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ASYMPTOTICS OF INPUT-CONSTRAINED BINARY 
SYMMETRIC CHANNEL CAPACITY 

By Guangyue Han 1 and Brian Marcus 

University of Hong Kong and University of British Columbia 

We study the classical problem of noisy constrained capacity in 
the case of the binary symmetric channel (BSC), namely, the capacity 
of a BSC whose inputs are sequences chosen from a constrained set. 
Motivated by a result of Ordentlich and Weissman [In Proceedings 
of IEEE Information Theory Workshop (2004) 117-122], we derive 
an asymptotic formula (when the noise parameter is small) for the 
entropy rate of a hidden Markov chain, observed when a Markov chain 
passes through a BSC. Using this result, we establish an asymptotic 
formula for the capacity of a BSC with input process supported on 
an irreducible finite type constraint, as the noise parameter tends to 



1. Introduction and background. Let X, Y be discrete random variables 

with alphabet X,y and joint probability mass function px,y(x,y) = P(X = 
x,Y = y), x £ X ,y & y [for notational simplicity, we will write p(x, y) rather 
than px,y( x i v)i similarly p(x),p(y) rather than px (x),py (y), resp., when it 
is clear from the context]. The entropy H(X) of the discrete random variable 
X, which measures the level of uncertainty of X, is defined as (in this paper 
log is taken to mean the natural logarithm) 

H(X) = - p(x) logp(x). 

The conditional entropy H(Y\X), which measures the level of uncertainty 
of Y given X, is defined as 

H{Y\X) = J2 p(x)H(Y\X = x) 

xex 
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= - ^2 p(x) ^2 p(y\x) log p{y\x) = - ^ p{x, y) log p{y\x). 
xgx y ey x&x, y ey 

The definitions above naturally include the case when X, Y are vector- valued 

variables, for example, X = Xf, = (X)~, Xfc+i, . . . , Xi), a sequence of discrete 
random variables. 

For a left-infinite discrete stationary stochastic process X = X^^ = {Xi : i = 
0, —1, —2, . . . }, the entropy rate of X is defined to be 

(1.1) H(X)=lim^—H(X°_ n ), 

n— >oo n + 1 

where H(X^_ n ) denotes the entropy of the vector- valued random variable 
X^_ n . Given another stationary process Y = Y® 00 , we similarly define the 
conditional entropy rate 

(1.2) H{Y\X)= lim _j_#(y° n |X° n ). 

n— >ao n + 1 

A simple monotonicity argument in page 64 of [8] shows the existence of the 
limit in (1.1). Using the chain rule for entropy (see page 21 of [8]), we obtain 

H(Y° n \X°_ n ) = H(X°_ n ,Y° n ) - H(X°_ n ), 

and so we can apply the same argument to the processes (X,Y) and X to 
obtain the limit in (1.2). 

If Y = Y^qq is a stationary finite-state Markov chain, then H(Y) has a 
simple analytic form. Specifically, denoting by A the transition probability 
matrix of Y, we have 

(1.3) H(Y) = H(Y \Y^) = - E P ( y o = i)A(i, j)log A(i,i). 

hi 

A function Z = Zz.^ of the stationary Markov chain Y with the form Z{ = 
$ (Yi) is called a hidden Markov chain; here $ is a function defined on the 
alphabet of Yi, taking values in the alphabet of Z^. We often write Z = &(Y). 
Hidden Markov chains are typically not Markov. 

For a hidden Markov chain Z, the entropy rate H(Z) was studied by 
Blackwell [6] as early as 1957, where the analysis suggested the intrinsic 
complexity of H(Z) as a function of the process parameters. He gave an 
expression for H{Z) in terms of a measure Q on a simplex, obtained by 
solving an integral equation dependent on the parameters of the process. 
However, the measure is difficult to extract from the equation in any explicit 
way, and the entropy rate is difficult to compute. 

Recently, the problem of computing the entropy rate of a hidden Markov 
chain has drawn much interest, and many approaches have been adopted 
to tackle this problem. These include asymptotic expansions as Markov 



INPUT-CONSTRAINED CHANNEL CAPACITY 



3 



chain parameters tend to extremes [14, 17, 18, 22, 23, 34, 35], analytic- 
ity results [13] , variations on a classical bound [9] and efficient Monte Carlo 
methods [2, 27, 31]; and connections with the top Lyapunov exponent of 
a random matrix product have been observed [11, 15, 16, 17], relating to 
earlier work on Lyapunov exponents [4, 25, 26, 28]. 

Of particular interest are hidden Markov chains which arise as output 
processes of noisy channels. For example, the binary symmetric channel with 
crossover probability e [denoted BSC(e)] is an object which transforms input 
processes to output processes by means of a fixed i.i.d. binary noise process 
E = {En} with pe„ (0) = 1 — e and pe„ (1) = £. Specifically, given an arbitrary 
binary input process X = {X n }, which is independent of E, define at time 



n, 



Z n (e)=X n ®E n 



{Z n (e)} is the output 



where © denotes binary addition modulo 2; then Z E 
process corresponding to X. 

When the input A" is a stationary Markov chain, the output Z e can be 
viewed as a hidden Markov chain by appropriately augmenting the state 
space of X [10]; specifically, in the case that X is a first order binary Markov 
chain with transition probability matrix 



n 



TOO 71"01 
7T10 VTn 

then Y e = {Y n (e)} = {(X n ,E n )} is jointly Markov with transition probability 
matrix 



y 



(0,0) (0,1) (1,0) (1,1) 



(0,0) 7r 00 (l-e) vr 00 e 7r i(l-e) 7r i£ 
A = (0, 1) 7r 00 (l - e) 7r 00 £ 7r i(l - e) 7r i£ 
(1, 0) ttio(1 - e) 7Ti £ 7rn(l - e) 7r u e 

(1, 1) 7Tl (l - e) 7Ti £ 7Tn(l - e) 7Tn£ 

and Z e = {Z n (e)} is a hidden Markov chain with Z n (e) = $>(Y n (e)), where 
$ maps states (0,0) and (1, 1) to and maps states (0, 1) and (1,0) to 1. 

In Section 2 we give asymptotics for the entropy rate of a hidden Markov 
chain, obtained by passing a binary Markov chain, of arbitrary order, through 
BSC(e) as the noise e tends to zero. In Section 2.1 we review, from [18], the 
result when the transition probabilities are strictly positive. In Section 2.2 
we develop the formula when some transition probabilities are zero (which 
is our main focus), thereby generalizing a specific result from [23]. 

The remainder of the paper is devoted to asymptotics for noisy con- 
strained channel capacity. The capacity of the (unconstrained) BSC(£) is 
defined 

1 



(1.4) 



C(e) = lim sup 



n+1 



(H(Z°_ n (s)) - H(Z°_ n (s)\X°_ n )); 
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here X®_ n is a finite-length input process from time — n to and Z^_ n (e) 
is the corresponding output process. Seminal results of information theory, 
due to Shannon [30], include the following: (1) the capacity is the optimal 
rate of transmission possible with arbitrarily small probability of error, and 
(2) the capacity can be explicitly computed: C(e) = 1 — H(e), where H(s) 
is the binary entropy function defined as 

H(e) = elog l/e + (1 - e) log 1/(1 - e). 

Generally speaking, it is very difficult to calculate the capacity of a generic 
channel. For a discrete memoryless channel without input-constraints, the 
Blahut-Arimoto algorithm [1, 7] can be applied to approximate the capacity 
numerically. A generalized Blahut-Arimoto algorithm has been proposed 
to numerically compute the local maximum mutual information rate of a 
finite state machine channel [32]. We are interested in input- constrained 
channel capacity, i.e., the capacity of BSC(e), where the possible inputs are 
constrained, described as follows. 

Let X = {0,1}, X* denote all the finite length binary words, and X n 
denote all the binary words with length n. A binary finite type constraint [20, 
21] S is a subset of X* defined by a finite set (denoted by J 7 ) of forbidden 
words; in other words, any element in S does not contain any element in 
T as a contiguous subsequence. A prominent example is the (d, /c)-RLL 
constraint S(d,k), which forbids any sequence with fewer than d or more 
than k consecutive zeros in between two l's. For S(d,k) with k < oo, a 
forbidden set T is: 

T = {1 — 1 : < I < d} U {0_^_0}. 

i fe+i 

When k = oo, one can choose T to be 

J= '={10 — 01:0<Z<4; 
i 

in particular, when d = l,k = oo, T can be chosen to be {11}. These con- 
straints on input sequences arise in magnetic recording in order to eliminate 
the most damaging error events [21]. 

We will use S n to denote the subset of S consisting of words with length 
n. A finite type constraint S is irreducible if for any u,v G S, there is a w € S 
such that uwv G S. 

For a finite binary stochastic (not necessarily stationary) process X = 
X^_ n , define the set of allowed words with respect to X as 

A(X°_ n ) = {w°_ n G X n+1 : P(X°_ n = /J > 0}. 
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For a left-infinite binary stochastic (again not necessarily stationary) process 
X = X^qq, define the set of allowed words with respect to X as 

A(X) = {wl m G X* : m > 0, P(X° m = w°_ m ) > 0}. 

For a constrained BSC(e) with input sequences in 5, the noisy constrained 
capacity (7(5, e) is defined as 

C(S,s)= lim sup -l-(iT(Z° n ( e )) -F(Z° n (e)|X° n )), 

n ^ oo ^(X0 n )C5 n + 1 

where again Z°_ n {e) is the output process corresponding to the input process 
X®_ n . Let P (resp. P n ) denote the set of all left-infinite (resp. length n) 
stationary processes over the alphabet X. Using the approach in Section 
12.4 of [12], one can show that 

C(S,e) = lim sup J—( H (Z° n (s))-H(Z° n (e)\X?_ n )) 

(1.5) 

sup H(Z £ )-H(Z £ \X), 

XGP, A(X)CS 

where Z^_ n (e), Z e are the output process corresponding to the input processes 
X^_ n ,X, respectively. 

In Section 3 we apply the results of Section 2 to derive an asymptotic 
formula for capacity of the input-constrained BSC(e) (again as e tends to 
zero) for any irreducible finite type input constraint. In Section 4 we consider 
the special case of the (d, k)-KLL constraint, and compute the coefficients 
of the asymptotic formulas. 

Regarding prior work on (7(5, e), the best results in the literature have 
been in the form of bounds and numerical simulations based on producing 
random (and, hopefully, typical) channel output sequences (see, e.g., [3, 29, 
33] and references therein) . These methods allow for fairly precise numerical 
approximations of the capacity for given constraints and channel parameters. 

For a more detailed introduction to entropy, capacity and related concepts 
in information theory, we refer to standard textbooks such as [8, 12]. 

2. Asymptotics of entropy rate. Consider a BSC(e) and suppose the 
input is an mth order irreducible Markov chain X defined by the transition 
probabilities P(X t = ao\Xj:Zm = aZm), oP_ m G X m+1 , here again X = {0, 1}, 
and the output hidden Markov chain will be denoted by Z e . 

2.1. When transition probabilities of X are all positive. This case is 
treated in [18]: 
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Theorem 2.1 ([18], Theorem 3). If P(X t = a |X^ = cT X m ) > for all 
a°_ m € X n+l , the entropy rate of Z £ for small e is 

(2.1) H(Z £ ) = H(X)+g(X)e + 0(e 2 ), 

where, denoting by z^ the Boolean complement of z^, and 



~2m+l _ Zl . . . z m z m +iZ m+ 2 ' ' ' Z2 m +1, 



we have 



(2-2) 9(X)= £ ^(^ +1 )log^£^ 



2i tA2m+l 



Px{zf 



We remark that the expression here for g(X) is a familiar quantity in infor- 
mation theory, known as the Kullback-Liebler divergence; specifically, g(X) 
is the divergence between the two distributions Px(zf m+1 ) and Px(zf mJrl ). 

In [18] a complete proof is given for first-order Markov chains, as well as 
the sketch for the generalization to higher order Markov chains. Alterna- 
tively, after appropriately enlarging the state space of X to convert the mth 
order Markov chain to a first order Markov chain, we can use Theorem 1.1 
of [13] to show H(Z £ ) is analytic with respect to e at e = 0, and Theorem 
2.5 of [14] to show that all the derivatives of H{Z £ ) at e = can be com- 
puted explicitly (in principle) without taking limits. Theorem 2.1 does this 
explicitly (in fact) for the first derivative. 

2.2. When transition probabilities of X are not necessarily all positive. 
First consider the case when X is a binary first order Markov chain with 
the transition probability matrix 



(2.3) 



1—p p 
1 



where < p < 1. This process generates sequences satisfying the (d,k) = 
(l,oo)-RLL constraint, which simply means that the string 11 is forbidden. 
Sequences generated by the output process Z e , however, will generally not 
satisfy the constraint. The probability of the constraint-violating sequences 
at the output of the channel is polynomial in e, which will generally con- 
tribute a term O(eloge) to the entropy rate H(Z e ) when e is small. This was 
already observed for the probability transition matrix (2.3) in [23], where it 
is shown that 

(2.4) H{Z £ ) = H(X) + P^Pl £ log l/e + 0(e) 

l + p 

as e — > 0. 
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In the following we shall generalize formulas (2.1) and (2.4) and derive 
a formula for entropy rate of any hidden Markov chain Z e , obtained when 
passing a Markov chain X of any order m through a BSC(e). We will apply 
the Birch bounds [5], for n>m, which yield 

H (Z (e) | ZZn +m (e) , Xl™ +m ~ 1 , £l" +m " 1 ) 

(2.5) 

<H(Z e )<H(Z (e)\ZZ l n (e)). 



Note that the lower bound is really just 

H{Z Q {e)\Zll +m {e),XZZ +m - l )i 
since Z°_ n+m {e), if conditioned on A"I™ +m_1 , is independent of EZ^ 771 " 1 ■ 

Lemma 2.2. For a stationary input process XZ n and the corresponding 
output process ZZ n {s) through BSC(e) and < k <n, 

H(Z,(e)\ZZ l n+k {e),XZZ +k ~ l ) 

= H(X \XZi) + f k (X°_ n )elog(l/e) + g k n (X°_ n )e + 0(e 2 logs), 
where f^(XZ n ) and g^(XZ n ) are given by (2.8) and (2.9) below, respectively. 

Proof. In this proof w = wZ n , where W-j is a single binary bit, and we 
let v denote a single binary bit. And we use the notation for probability: 

pxzH = PiXZ^- 1 = wZ n n +k -\ZZ l n+k (e) = wZl +k ), 
Pxz(wv) = PiXZ^ 1 = wZ^\Zzl +k (e) = wZl +k , Z (e) = v) 

and 

Pxz(v\w) = P(Z (e) = v\Zzl +k (e) = wZ 1 n+k ,XZZ +k - 1 = wZZ +k - 1 ). 

We remark that the definition of pxz does depend on e and how we partition 
wZ n according to k, however, we keep the dependence implicit for notational 
simplicity. 

We split F(Z (e)|Z-i +fc (e),Xr™ +fe - 1 ) into five terms: 
HiZ^ZZl^XZ^' 1 ) 

= -Pxz{wv)\og{pxz{v\w)) 

wveA(X) 

+ -pxz{wv)\og{pxz{v\w)) 
weA(X),wv<£A(X) 

(2.6) 
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+ Y -pxz(wv)log(pxz(v\w)) 

Pxz(w)=@(e),pxz{wv)=B(e) 

+ Y -Vxz(wv)\og{pxz{v\w)) 

Pxz{w)=Q{e),pxz{niv)=0{e 2 ) 

+ Y -pxz{wv)\og{pxz{v\w)), 

Pxz(w)=0(e 2 ) 

here by a = @(/3), we mean, as usual, there exist positive constants C\,C2 
such that C\\/3\ < \a\ < C2 1 /3| , while by a = 0(f3), we mean there exists a 
positive constant C such that \a\ < C|/3|; note that from 



Pxz(w) 



Y P{xz n n +k ~ l = wz n n +k ~\xzl +k = uZl +k ) 



— 1 — n + fc — 1 — 1 a f v — 1\\ 

«_ n+fe ^-„ + u_ n+k eA{x_J 

j=—n+k J 

we see that pxziw) = @(e) is equivalent to the statement that w ^ A(XZn), 
and by flipping exactly one of the bits in io~ n , one obtains, from w, a 

sequence in A(XZn)- 

For the fourth term, we have 

Y -pxz{wv)\og(pxz(v\w)) = 0(e 2 \oge). 

Pxz{-w)=B{e),p X z{wv)=0{e 2 ) 

For the fifth term, we have 

Y -Pxz{wv)\og{pxz{v\w)) 

Pxz{w)=0{e 2 ) 

= Y -Pxz(w)Ypxz(v\w)log(p X z(v\w)) 

p xz (w)=0(e 2 ) v 

<(log2) Y, Pxz(w) = 0(e 2 ), 

Pxz(w)=0(e 2 ) 

where we use the fact that — J2 v Pxz{v\w) log(pxz(v\w)) < log2 for any w. 
We conclude that the sum of the fourth term and the fifth term is 0(e 2 logs). 
For a binary sequence uZ\, define h^uZn) to be 

n—k 

(2.7) h*(uZl) = Y PxiuZ^U-juZ)^) - (n - k)p x {uZ l n ). 
3=1 
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Note that with this notation, h^(w) and h k l+1 (wv) can be expressed as 
derivatives with respect to e at e = 0: 

h n(w) =p'xz(w)\e=0, h* +1 (wv) = p xz (wv) | £=0 . 

Then for the first term, we have 

E -pxz(wv)log(pxz(v\w)) 
wveA(X) 

= - E (px{wv)+h k n+l {wv)e + 0(e 2 )) 

wveA(X) 

h k l+1 (wv)p x {w) - h k l (w)p x (wv) 



x log(p x {v\w) + 



p 2 x (w) 

+ 0(e 2 ) 



= H(X \X_l)- J2 (h k n+1 (wv) log Px {v\w) 

h n+l( wv )Px(w) ~ h*{w)p X (wv] 

px(w) 
+ 0(e 2 ). 

For the second term, it is easy to check that for w E -A(X) and wv £ A(X), 
Pxz{v\w) = ©(^) and so 

Pxz(wv) = h k n+l {wv)e + 0(e 2 ); 

we then obtain 

E -Pxz{wv)\og(pxz{v\w)) 
weA{X),wv<£A{X) 

E h k n+1 (wv)elo g h kl^± + °( g2 ) + (e 2 ) log 9(e) 
iceX(X),«)^A(X) PxW 

E ^£+iWelog(l/e) 
«ie^(X),i»^-4(x) 

"f E ^ +1 Mlog^^) e + 0( £ 2 log e ). 

For the third term, we have 

E -pxz{wv)\og(pxz{v\w)) 

Pxz{w)=S(e),pxz{wv)=S(e) 
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E (h k n+1 (wv)e + 0(e 2 )) 

Pxz(w)=B(e),pxz(wv)=e(e) 

v1 f h k n+l {wv) 

= -( E ^ +lM lo g (%^)) £ + 0(^ 

V xz (« 1 )=e(e), Pzz (™)=e( £ ) V n\W) / J 

In summary, i7(Zo(e)|Z~^ +fc (e), can be rewritten as 

HiZoie^Zzl^XZ^- 1 ) 

= H(X \Xzl l ) + f k (X\)elog(l/e)+g k n (X\)e + O(e 2 loge), 
where [see (2.7) for the definition of h k (-)} 
f k (X\) = h k n+1 (wv) 



(2.8) 



w£A(X),wv(£A(X) 

(n—k 
J2px(wZ 3 n ~ 1 w^ j wZ 1 j+1 v)+px(wZl l ) 



and 



g k {X\) = - (h k +1 (wv) log px(v\w) 

wv&A{X) Vl 

9 ^ + fe£+i( MT; )PA- (w) - h k (w)p x (wv) 



Px{w) 
E ^+i(«w)log 



h k +1 (wv) 



weA(X),wvfA(X) Px{w) 
PxzW=e( £ )j. M H=e( E ) v nK > 



Remark 2.3. For any 5 > and fixed n, the constant in 0(e 2 loge) in 
Lemma 2.2 can be chosen uniformly on W n+ i g, where P n +i,a denotes the set 
of binary stationary processes X = X^ n , such that, for all w°_ n S A(X), we 
have px{w) > 5. 

Theorem 2.4. For an rath order Markov chain X passing through a 
BSC(e), with Z £ as the output hidden Markov chain, 

H(Z £ ) = H(X) + /(X)eIog(l/e) + g(X)s + 0{e 2 logs), 

where f{X) = / 2 ° m (X° 2 J = / 2 ™ (X° 2m ) and g(X) = g° 3m (X°_ 3m ) = gf m (X%J. 
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Proof. We apply Lemma 2.2 to the Birch upper and lower bounds 
[equation (2.5)] of H(Z £ ). For the upper bound, k = 0, we have, for all n, 

H(Z (e)\Zlfe)) = H(X \XZ l n ) + /°(X°Jelog(l/e) 

+ g n (XZ n )e + O(eHoge). 

And for the lower bound, k = m, we have, for n>m, 

H(Z (e)\Zzl +m (e),XZn +m ~ 1 ) 

= H{X Q \XZ l n ) + C(A^ n >log(l/e) + g™(X°_ n )s + 0(e 2 lege). 

The first term always coincides for the upper and lower bounds. When 
n > m, since X is an mth order Markov chain, 

H{X \XZ l n ) = H(X \Xzi) =H{X). 

Let w = wZ n , where W-j is a single bit, and v denotes a single bit. If 
w £ A(X) and wv ^ A(X), then px(wZmv) = 0. It then follows that for an 
mth order Markov chain, when n > 2m, 

(2.10) f™(X° n ) = /°(X° n ) = / 2 ° m (^2m) = f£n(X-2m)- 

Now consider ^(X^ n ). When < < m, we have the following facts [for 
a detailed derivation of (2.11)-(2.13), see the Appendix]: 

(2.11) if wv£A(X), px{v\w) = px(v\wZm) for n > m, 
(2.12) 



if weA(X), wv<£A(X), 
h k n+l (wv) 



is constant (as function of n and k) for n > 2m, < /c < m, 



(2.13) 



if pxz(w) = <d(e),pxz(wv) = 6(e), 



is constant for n > 3m, < /c < m. 



h*(u>) 

It then follows [see the derivations of (2.14)-(2.16) in the Appendix] that 
^ 2 14 ^ fr«+i(?HPxM ~ ^n( w )Px(wt;) 

is constant (as a function of n) for n > 2m, < A; < m, 

(2.i5) e ^(-)^%£r 
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is constant for n > 2m, < k < m, and 



E h n + i(wv)\ogp x (v\w) 
wv&A(X) 



(2.16) 



+ E 



Pxz(w)=Q(e),pxz(wv)=Q(e) 



h: 



ri 



k 



, +1 (wv) log 



h n{w) 



is constant for n > 3m, < k < m. 
Consequently, we have 



(2.17) 9n( X -n) = 9n(X-n) = </L(*-3m) = 9 ? m (X% m ). 

Let f(X) = f2m{ x -2m) and g{X) =53m(^-3m)» tnen tne theorem follows. 



Remark 2.5. Note that this result applies in particular to the case 
when the transition probabilities of X are all positive; thus, in this case the 
formula should reduce to that of Theorem 2.1. Indeed, when all transition 
probabilities of X are positive, f(X) vanishes since the summation in (2.8) 
is taken over an empty set; on the other hand, again from (2.8), if some 
of the transition probabilities of X are zero, then f(X) does not vanish 
[to see this, note that when w G A(X),wv ^ A(X), necessarily we will have 
wv € A.(X)]. The agreement of g(X) with expression in Theorem 2.1 is a 
straightforward, but tedious, computation. 

Remark 2.6. Together with Remark 2.3, the proof of Theorem 2.4 im- 
plies that for any 5 > and fixed m, the constant in 0(e 2 loge) in Theo- 
rem 2.4 can be chosen uniformly on Q mj< 5, where Q m ,s denotes the set of all 
mth order Markov chains X such that, whenever w = w®_ m E A.(X), we have 
px( w) > 5. 

Remark 2.7. The error term in the formula of Theorem 2.4 cannot be 
improved, in the sense that, in some cases, the error term is dominated by 
a strictly positive constant times e 2 loge. 

As we showed in Theorem 2.4, the Birch upper bound with n = 3m yields 



Together with (2.6), one checks that the 0(e 2 loge) term in the error term 
0(e 2 loge) is contributed by [see the second term in (2.6) with k = 0] 



□ 



H{Z {e)\Zzl{e)) = H{X) + /(X)elog(l/e) + g(X)e + 0(e 




E 



pz{wv)\og(pz{v\w)) 
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and [see the fourth term in (2.6) with k = 0] 

X] -p z (wv)log(p z (v\w)), 

Pz(w)=0(e),p z (wv)=O(e 2 ) 

and this B(e 2 loge) term does not vanish at least for certain cases. For 
instance, consider the input Markov chain X with the following transition 
probability matrix: 

1 — p p 
1 

where < p < 1. Then one checks that for this case, m = l,n = 3, and the 
coefficient of the above-mentioned 0(e 2 loge) term takes the form of 

1 — 6p + 7p 2 — 

T+p ' 

which is strictly positive for p is close to 0. 

3. Asymptotics of capacity. Consider a binary irreducible finite type con- 
straint S defined by which consists of forbidden words with length rh + 1. 
In general, there are many such .F's corresponding to the same S with dif- 
ferent lengths; here we may choose T to be the one with the smallest length 
rh + 1. And rh = rh(S) is defined to be the topological order of the constraint 
S. For example, the order of S(d,k), discussed in the introduction, is k [20]. 
The topological order of a finite type constraint is analogous to the order of 
a Markov chain. 

Recall from (1.5) that for an input-constrained BSC(e) with input se- 
quences X^_ n in S and with the corresponding output Z^_ n {e), the capacity 
can be written as 



C(S,e) = hm sup (l/(n + l)(H(Z°_ n (e)) - H(Z°_ n (e)\X° n ))) 



Since the noise distribution is symmetric and the noise process E is i.i.d. 
and independent of X, this can be simplified to 

C(S,e)= lim sup H(Z° n (e))/(n + 1) - H(e), 

n ^°° X° n gP„+i ,A(X° n )CS 

which can be rewritten as 

C(S,e)= lim sup H(Z Q (e)\Z^{e)) - H(e), 

n ^°°x°_ n eiPn+i,A{x _ n )cs 

where we used the chain rule for entropy (see page 21 of [8]) 

n 

H(Z°_ n (e)) = J2H(Z (e)\ZZj(s)), 



14 G. HAN AND B. MARCUS 

and the fact that (further) conditioning reduces entropy (see page 27 of [8]) 
H(Z (s)\ZZl(e)) > H(Z {e)\ZZ] 2 (e)) for h < j 2 . 
Recall from (1.5) that 

C(S,e)= sup H(Z £ )-H(Z £ \X). 

xe¥,A(x)cs 

Now let 

H n (S,e)= sup H(Z (s)\ZZ n (s)) 

X<l n eP n+1 ,A(X°JCS 

and 

h m (S,e)= sup H(Z e ), 
XeM m ,A(X)cs 

where M m denotes the set of all mth order binary irreducible Markov chains; 
we then have the bounds for C(5,e): 

(3.1) h m (S, e) - H(e) < C(S, e) < H n (S, e) - H{e). 

Noting that 

sup H(X \XZn)< sup H(X \XZn) 

X°_ n EF n+1 ,A{X° n )CS n+1 X° n ePn+l,A(X°_ n )=S n+1 

(here C means "proper subset of"), and H {Zq{e)\Zz} 1 {£)) are continuous at 
e = 0, we conclude that, for e sufficiently small (e < Eq), one may choose 
5 > (here, 5 depends on n and m) such that 

H n (S,e)= sup H(Z (e)\ZZ n (e)). 

So from now on we only consider stationary processes X = X^_ n with A{X®_ n ) = 

Sn+l- 

Now for a stationary process X = XZ n , define p n as the following proba- 
bility vector indexed by all the elements in 5 n +i: 

p n =p n (X-n) = (^(*-n = «>-„) : wZ n E S n+1 ). 

To emphasize the dependence of X_ n on p n , in the following, we shall rewrite 
XZ n as X^_ n (p n ). For an mth order binary irreducible Markov chain X = 
XZqc, slightly abusing the notation, define p rn as the following probability 
vector indexed by all the elements in S m+ i, 

p m =Pm(X _ oo ) = (P(X°_ m = w°_ m ) : w°_ m E S m+1 ). 

Similarly, to emphasize the dependence of X = XZ^ on p m , in the following, 
we shall rewrite X as X^ m . And we shall use ZZ n {p n ,£) to denote the output 
process obtained by passing XZ n {p n ) through BSC(e), and use Zp mj£ to 
denote the output process obtained by passing Xp m through BSC(e). 
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Lemma 3.1. For any stationary process X^ n (p n ) with A{X®_ n {p n )) = 
S n+ \, H(Xo(p n )\XZ n (Pn))> as a function of p n , has a negative definite Hes- 
sian matrix. 

PROOF. Note that 

H(X {p n )\Xzi(p n )) = - £ p(x _Jlogp(x |x:i). 

For two different probability vectors p n and q n , consider the convex combi- 
nation 

f n (t) = tp n + (1 - t)q n , 

where < t < 1. It suffices to prove that H (Xo(r n (t))\XZn(r n (t))) has a 
strictly negative second derivative with respect to t. Now consider a single 
term in H(X (p n )\Xzl(p n ))- 

i,^ i \ , n j.\ - t \\i tp n (x°_) + (1 — t)q n (x°_ n ) 

tPn{X-n) + (1 - %n0z-n) 

Note that for two formal symbols a and /?, if we assume a" = and /3" = 0, 
the second order formal derivative of a log % can be computed as 

alog V = (v^-^J) ■ 

It then follows that the second derivative of this term (with respect to t) 
can be calculated as 

Pn(x®_ n) ~ InjxP-n) 

^tPn(x\) + (l-t)q n (x°_ n ) 

rrz , o : ; z r^TTT Pn( x -( n -i)) ~ (n _i)) 

- Jtp n (x U _ n ) + (1 - < —77-5 V ' ) r 

That is, the expression above is always nonpositive, and is equal to only if 

Pn(x°_ n ) - q n (x _ n ) Pn(x°_ (n _ 1) ) - q n (x°_ (n _ 1) ) 

tp n (x° n ) + (1 - *)&(x° J ~ tp n (x° (n _ 1} ) + (1 - *)&(a;° (n _ 1} ) ' 

which is equivalent to 

P(X (Pn) = Xo\XZi{p n ) = XZ\) 

(3.2) 

= P{X (q n ) = x \X_ n (q n ) = x_l). 

Since S is an irreducible finite type constraint and A(X°^ n (p n )) = A(X^_ n (q n )) 
5 n+ i, the expression (3.2) cannot be true for every x°_ n unless p n = q n . So 
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we conclude that the second derivative of H(Xo(r n (t))\XZn(rn(t))) (with 
respect to t) is strictly negative. Thus, H(Xo(p n )\XZ n (p n )), as a function 
of p n , has a strictly negative definite Hessian. □ 

For to > to, over all mth order Markov chains X^ m with A(X^ m ) = S, 
H(Xp m ) is maximized at some unique Markov chain X^max (see [20, 24]). 
Moreover, Xpma* does not depend on m and is an mth order Markov chain, 
so we will drop the subscript m and instead to denote X^m^ 

for any to > m. The same idea shows that over all stationary distributions 
X°_ n (p n ) (n>m) with A{X°_ n (p n )) = S n+ i, H(X (p n )\XZn(Pn)) is maxi- 
mized at p^ ax , which corresponds to the above U.Ilic|U6 JC^txiax 3jS well. 

Note that C(<S) = C(S,0) is equal to the noiseless capacity of the con- 
straint 5. This quantity has been extensively studied, and several interpre- 
tations and methods for its explicit derivation are known (see, e.g., [21] and 
the extensive bibliography therein). It is well known that C(S) = H(Xpm^) 
(see [20, 24]). 

Theorem 3.2. 1. Ifn>3m(S), 

H n (S, e) = C(S) + /(A> ax )elog(l/£) + g{X^)e + 0(e 2 log 2 e). 
2. Ifm>m(S), 

h m (S,e) = C{S) + /(Xptna*)£log(l/e) + g(Xp***)e + 0(e 2 log 2 e). 
Here, as defined in Theorem 2.4, /(X^tnax) = /2 r n(^-2?fi(P max )) an( ^ 9(^p taax ) = 

Proof. We first prove the statement for 7i. n (S, e). As mentioned before, 
for e sufficiently small (e < £q), TL n {S,e) is achieved by X®_ n with A{X < Z n ) = 
<S n +i; and one may choose 5 such that 

H n (S,e)= sup H(Z (p n ,e)\ZZn(Pn,e))- 

Below, we assume e < e , X\(p n ) £ P n+ i,<5, A(XZ n (p n )) =S n +r, and for 
notational convenience, we rewrite f®(XZ n (p n )) as f n {Pn), 9n(^--niPn)) as 

QniPn)- 

In Lemma 2.2 we have proved that 
H(Z {p n ,e)\Zzl(p n ,s)) 

= H(X (p n )\XZn(Pn)) + fn (p n )s log(l/e) + g n (pn)e + 0{e 2 \oge). 

Moreover, by Remark 2.3, for any 5 > 0, 0(e 2 loge) is uniform on P n+ i j( 5, 
that is, there is a constant C (depending on n) such that, for all XZ n with 
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X°_ n (p) G P n+1 , 5 and A(X° n )=S n+1 , 
\H(Z (p n ,e)\Zll(p n ,e)) 

- H(X (pn)\XZn(Pn)) ~ fn{Pn)£ log(l/e) - g n {Pn)e\ 

<Ce 2 loge. 

Let q n =p n — P^ ax - Since H (X (p n )\XZniPn)) is maximized at p^ ax , we 
can expand H (Xo(p n )\XZn(Pn)) around p^^: 

H(X (p n )\Xzl(p n )) = H(X (^)\Xz}Mr)) + tfKiHn + 0(|g*n| 3 ) 
= H{Xpn**) + $ n K x q n + 0(\q n \ 3 ), 

where K\ is a negative definite matrix by Lemma 3.1 (the second equality 
follows from the fact that Xpmax is an mth order Markov chain). So for \q n \ 
sufficiently small, we have 

H(X (p n )\Xzl(p n )) < H{X^) + (l/2)^i^ n . 

Now we expand f n (Pn) and g n {Pn) around p^ ax : 

fniPn) = /n^) + K 2 • q n + 0(\q n \ 2 ), 
g n (Pn)=gn{p^^)+K 3 -q n + 0(\q n \ 2 ) 

(here, K 2 and K 3 are vectors of first order partial derivatives). Then, for 
\q n \ sufficiently small, we have 

UPn) < UtiC*) +2j2\K2, j \\q n , j \, 

j 

5n(p)<5n(^ laX )+2^|K 3j lkn, i |, 

j 

where K 2 j, K 3 j,q n j are the jth coordinates of K 2 ,K 3 ,q n , respectively. 

With a change of coordinates, if necessary, we may assume K\ is a di- 
agonal matrix with strictly negative diagonal elements K±j. In the fol- 
lowing we assume < e < Eq. And we may further assume that for some 
I > 1, \q n j\ > ±\K 2 , j /K 1 j\elog(l/£) + 4\K 3tj /K ltj \£ for j < I- 1, and \q n>j \ < 
4\K2j/Kij\elog(l/e) + 4\K 3J / K hj \e for j > I. Then for each j < I - 1, we 
haye\l/2jK 1J ql J + 2\K 2 J\q n j\eiog(l/£) + 2\K 3J \\q ntj \e <0. Thus, 

H(Z (p n ,s)\Zzl(p n ,e)) 

< H(X^) + / n (p^ ax )6log(l/e) + g n (^)e 

+ ^{{l/2)K ld ql :j + 2\K 2 j\\q n j\E\og(l/£) + 2\K 3 j\\q n j\£) 

3 
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+ Ce 2 logs 
< tf(A>ax) + / n (pr x )elog(l/e) 

+ 9n(j%**)e + E( 1 /2)^i > i(4|^2 >i /^ij|£log(l/£) + 4\K 3J /K ld \ef 
j>i 

+ Y / m2,mK2j/K lj \elog(l/e)+A\K 3J /K l j\e)elog(l/e) 

3>l 

+ Y, 2|^3 j I (41^2,^/^1 j \e log(l/e) + 4|tf 3 j/K ld \e)e + Ce 2 log e. 

Collecting terms, we eventually reach 

H(Z (p n ,e)\Zzl(p n ,e)) 

<H{X^) + f n {^)e\og{l/e) + g n (flT)e + 0(e 2 log 2 e), 

and since Ti. n (S,e) is the sup of the left-hand side expression, together with 
-£f(Xptaax) = C(S), we have 

H n (S, e) < C(S) + /„GpT x )e log(l/e) + 9nip^^)e + 0(e 2 log 2 e). 
As discussed in Theorem 2.4, we have 

(3.3) f n (fir) = f{Xp»«* ), n>2m, 
and 

(3.4) 9n(p7 ax )=9(X^), n>3m. 
So eventually we reach 

H n (S, e) < C(<S) + /(Xp taa x)elog(l/e) + ^pC^a^e + 0(e 2 log 2 e). 

The reverse inequality follows trivially from the definition of TL n {e). 
We now prove the statement for h m (S,e). First, observe that 

T~tz m {S,E) > h m (S,e) > h^(S,e) > H(Zpmax tE ), 

where the output process corresponding to input 

By part 1, 7i 3rn {S \ e) is of the form C(S) + f{Xpmax)e\og{l/e) + g(X^m^)e + 
0[e 2 log 2 e). By Theorem 2.4, H(Z^ )£ ) is of the same form. Thus, h m (S,e) 
is also of the same form, as desired. □ 

Corollary 3.3. 

C(S,s) = C(S) + (/(A>ax) - l)eIog(l/e) + (g(X^ x ) - l)e + 0(e 2 log 2 e). 
In fact, for each m > fh(S), h m (S,e) — H(e) is of this form. 
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Proof. This follows from Theorem 3.2, inequality (3.1) and the fact 
that 

H(e) = elog l/e+(l-e) log 1/(1 -e) = elog l/e + e + 0(e 2 ). □ 

Remark 3.4. Note that the error term here for noisy constrained capac- 
ity is 0(e 2 log 2 e) , which is larger than the error term, 0(e 2 log e) , for entropy 
rate in Theorem 2.4. At least in some cases, this cannot be improved, as 
we show at the end of the next section. 

4. Binary symmetric channel with (d, fc)-RLL constrained input. We 

now apply the results of the preceding section to compute asypmtotics for 
the the noisy constrained BSC channel with inputs restricted to the (d, k)- 
RLL constraint S(d,k). Expressions (2.8) and (2.9) allow us to explicitly 
compute /(Xptaax) and g(X^^). In this section, as an example, we derive 
the explicit expression for /(A^max), omitting the computation of g(Xpma.*) 
due to tedious derivation. We remark that for a BSC(e) for some cases, the 
(d, fc)-RLL constrained input, similar expressions have been independently 
obtained in [19]. 

It is first shown in [19] that in the case k < 2d, for any binary stationary 
Markov chain X, of any order, with A(X) C S(d,k), f(X) = 1, and so, in 
this case, C(S(d,k),e) = C(S(d,k),0) +0(e), that is, the noisy constrained 
capacity differs from the noiseless capacity by O(e), rather than O(eloge). 
In the following we take a look at this using a different approach. For this, 
first note that for any d, k, f(X) takes the form 

f(X)= J2 Px(W h+h+1 l) 

h+l2<k-l,0<l 2 <d-l,h>d 

(4.1) 

+ Px(W h W h )+ £ px(W l ). 

h+l 2 =k,h>d l<l<d 

Now, when k < 2d, 

J2 P x(W ll lO h )= ]T p x (10 h l)=p(l) 

h+l 2 =k,h>d d<h<k 

and 

h+h<k-l,a<h<d-l,h>d 

= Px (W d+1 )+Px(W d+2 ) + ■ ■ -+Px(W h ). 

So 

f{X) =p x (l) +Px(W) + • • • +Px(10 d ) +px(l0 d+1 ) + • • • + Px (W k ) = 1, 
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as desired. 

Now we consider the general RLL constraint S(d,k). By Corollary 3.3, 
we have 

C(S(d,k),e) = C(S(d,k)) + (/(X p - max ) - l)elogl/e 

(4.2) 

+ (g(X p ^J-l)e + 0(e 2 log 2 e). 

For any irreducible finite type constraint, the noiseless capacity and Markov 
process of maximal entropy rate can be computed in various ways (which 
all go back to Shannon; see [21] or [20], page 444). Let A denote the 
adjacency matrix of the standard graph presentation, with k + 1 states, of 
S(d,k). Let p denote the reciprocal of the largest eigenvalue. One can write 
C(S(d,k)) = — log/9o, and in this case po is the real root of 

(4-3) I>o +1 = l- 

e=d 

In the following we compute /(-Xp max ) explicitly in terms of pq. Let w = 
(wo,w\, . . . , Wk) and v = (i>o, V\, . . . , Vf.) denote the left and right eigenvectors 
of A. Assume that w and v are scaled such that w ■ v = 1. Then one checks 
that, with X = Xpma*, 

Px(l) = "^o^o = ; ■ ■ 1 ; — -, 

(A ; + i)-E,WiECo d_1 VPo" J 

p x (W h W h ) = p x (W h W h 1) +Px(W h W h+1 l) + • • ■+ Px (l0 h W k l) 
= Px(l)p l 1+l2+2 (l + Po + --- + P k - 12 ) 



■Pxi^Po 1 " 



h+h+2 1 ~ Po 



1-po 
and 

Px(W l )=px(l0 l l)+p x (10 l+1 l) + ---+Px(W k l) 

= px(l)p l +1 (l + Po + ■ • ■ + pV) = Px(l)p l +l l ~ pk ° ■ 

i- — Po 

So we obtain an explicit expression: 

/(A>a*)= Yl P X (W h+h + 1 l) 

h+l2<k-l,0<l2<d-l,h>d 

+ ( E + E )px(w l no l ?)+ E px(io 1 ) 

\l 1= k,l2=0 h+l2=k,k~l>h>d/ l<l<d 
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-px(M +1 + E px{i) P l t l2+2 

h+l 2 <k-l,0<l2<d-l,h>d 

„k-l 2 +l 

+ e pxiM 1 



Ji+l 2 +2 1 — Po 



1 PO 



+ E p*U)p& +1 V° 



KKd 



PO 



The coefficient 5 can also be computed explicitly but takes a more compli- 
cated form. 

Example 4.1. Consider a first order stationary Markov chain X with 
.A(X) C 5(1, 00), transmitted over BSC(e) with the corresponding output Z, 
a hidden Markov chain. In this case, X can be characterized by the following 
probability vector: 

Pi = (px(00),pjr(01),px(10)). 

Note that m(5) = 1, and the only sequence W-2W-1V, which satisfies the 
requirement that w_2W_i is in 5 and W-2W-1V is not allowable in 5, is Oil. 
It then follows that 

(4.4) f(X Pl ) = p(011) -f p(01l) + p(011) = vr i(2 - tt i)/(1 + tt i), 

where 7Toi denotes the transition probability from to 1 in X . Straightfor- 
ward, but tedious, computation also leads to 

g{Xfr) = (1 + ^ i)~ 1 (2^oi - 71*01 " 27r oi + 37r oi " 4i 
+ (-2 7 r i+47rg 1 -24 1 )log(2) 

+ (-1 + 37T01 - TT^ - 27TQ1 + 57TQ1 - 3tT^) log(TToi) 

+ (2 - 6tt i + 7^ - 8tt^ + 3^) log(l - TToi) 

+ (27T01 + TT 2 m - 37rgi + 74) log(2 - TTOl)). 

Thus, 

fl"(%, £ ) = F(X pl ) + (tt 01 (2 -ttoi)/(1 + 7r i))elog(l/e) 

+ (g(X fl )-l)e + 0(e 2 \oge). 

This asymptotic formula was originally proven in [23], with the less precise 
result that replaces {g(X A ) - l)e + 0(e 2 log 2 (l/e)) by 0(e). 

The maximum entropy Markov chain X^max on 5(1, 00) is defined by the 
transition probability matrix 

1/A 1/A 2 " 
1 
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and 

C(S) = H(Xp m )=1og\, 

where A is the golden mean. Thus, in this case 7i"oi = 1/A 2 and so by Corol- 
lary 3.3, we obtain 

C(5,e) = logA-((2A + 2)/(4A + 3))elog(l/e) 

+ (g(X A )\ noi=1/X 2 - l) e + 0( e 2 log 2 (l/e)). 

We now show that the error term in the above formula cannot be im- 
proved, in the sense that the error term is of size at least a positive constant 
times e 2 log 2 (l/e). First observe that if we parameterize p\ = p\{e) in any 
way, we obtain 

(4.5) C(S,e)>H(Z fl{£)>£ ) -H{e). 

Since p\ is uniquely determined by the transition probability 7Toi, we shall 
re-write p\{e) as 7roi(e). We shall also rewrite the value of 7roi = 1/A 2 at the 
maximum entropy Markov chain as p max . 

Choose the parametrization 7Toi(e) = p ma , x + aelog(l/e), where a is se- 
lected as follows. Let K\ denote the value of the second derivative of H(X nQ1 ) 
at 7Toi = Pmax (the first derivative at 7Toi = p ma , x is 0). Let K 2 denote the value 
of the first derivative of f(X WQ1 ) at 7r i = p max - These values can be computed 
explicitly: K\ from the formula for entropy rate of a first order Markov chain 
(1.3) and K 2 from (4.4) above. A computation shows that K\ ~ —3.065 and 
K 2 ~ 0.571 (all that really matters is that neither constant is 0). Let a be 
any number such that < a < K2/\K±\. 

From Theorem 2.4 and Remark 2.6, we have 

H(Z noi{£h£ ) > H(X noi{£) ) + /(X 7roi(£) ) £ log(l/ £ ) 

(4.6) 

+ 9(X noi {£ ))e + Cie 2 loge, 
for some constant C\ (independent of e sufficiently small). We also have 

(4.7) H(X noi(£) ) > H(X Pm J + K^aelogil/e)) 2 + C 2 {ae\og{\ / e)f 
for some constant C2. And 

(4.8) /(AV (£) ) > f(X Pm J + K 2 (aelog(l/e))+C 3 (aelog(l/e)f, 

(4.9) g(X W0l{£) ) > g(X p _J+C A (aelog(l/e)) 
for constants C^,C^. And recall that 

(4.10) H(e) =elogl/e + log 1/(1 - e) = elog 1/e + e + C 5 e 2 
for some constant C5. 
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Recalling that H(X Pmax ) = C(S) and combining (4.5)-(4.10), we see that 

C(S,e) > C(S) + (/(X Pm J - l)eIog(l/e) + (g(X Pm J - l)e 

+ Ki(aelog(l/e)) 2 + K 2 {ae 2 log 2 (l/e)) 

plus "error terms" which add up to 

GVloge + C7 2 (aelog(l/e)) 3 + C 3 a 2 (e log(l/e)) 3 

+ C 4 (ae 2 log(l/e))+C 5 e 2 , 

which is lower bounded by a constant M times £ 2 log(l/e). Thus, we see 
that the difference between C(S,e) and C(S) + (f(X Pwax ) - l)elog(l/e) + 
(g(X Pmax ) — l)e is lower bounded by 

(4.11) a(K 1 a + K 2 )e 2 log 2 {l/e) + Me 2 log(l/e). 

Since a > and K\a + K 2 > 0, for sufficiently small e, (4.11) is lower 
bounded by a positive constant times e 2 log 2 (l/e), as desired. 

APPENDIX 

We first prove (2.11)-(2.13). 

• (2.11) follows trivially from the fact that X is an mth order Markov chain. 

• Now consider (2.12). For w G A(X) and wv g A(X), 

n—k 

h* +1 (wv) = J2Px(wZ J ~ 1 w- j wZ] +1 v) +p x (wZ n v) 
i=i 

m 

= Ypx{wZl~ l w-jwZ) + iv)+px{wZ n v). 



So 



h* +1 {wv) _ YJjLiPxjwJn 1 w- j w_) +1 v)+p x {w_ 1 n v) 
Px{w) Px{wZ n ) 

(m 
Px {wZ^w-jwZ^vlwZ™^ 1 ) 

+ Px (wZ 1 m v\wZT m ~ 1 ) ) J 
xpxiwZ^ipxiwZllwZT-^PxiwZ™- 1 ))- 1 

_ T^=lPx{wZi~nW-jWZ) +1 v) +Px{wZ\ rn v) 



Px{w_\ 



2m J 
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For (2.13), there are two cases. If px(w_™ ) = 0, 
h k n+1 (wv) _ T^ZiPxjwZt^-jwZ^v) 

Y^Zt+iPxiwZ^w.jwZ)^) 
= — — i —i ^ = Px\v\ 

En—k i —7 — 1 - — 1 \ y 1 



h k n+l (wv) _ Z^PxiwZl 1 



w -j w -J+l v 



_ S j=l Px (wZj^w-jwZj^v) 
Y, 2 j=iPx(wZ 3 n ~ 1 w^ j wZ 1 j+1 ) 

E|=i (^Zl" 1 w-jwZj+t ) 
Using (2.11)-(2.13), we now proceed to prove (2.14)-(2.16) 
For (2.14), we have 

hn + i(wv)p x (w) - h k l (w)p x (wv) 

m ,k { x) 

= h n+l( wv ) ~ J2 h t( w )Px( v \ w ~m) 

wv£A(X) wv£A(X) 

/n-k > 

= \Y.Px (wZi^w-jwZ^v) + p x (wZ n v) 

wvGA(X) \j=l / 

-(n + l-k) Px(wv) 

wveA(X) 
n—k 

~ J2 ^Px{wZ ] ~ l w- j wZ) +1 )px{v\wZ l m ) 

wvGA(X) j=l 

+ {n-k) ^2 px(wv) 
wveA(X) 

n—k 

= Y,Px(wZiT 1 iB-jwZj +1 v) 

wv&A{X) j=l 
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n—k 

weA(x) j=i wveA(x) 

n—k 

wveA(X) j=l 

(n—k 
3=1 

n-k \ 

3=1 / 
+ Y PxiwllnV)-! 

wzl^veAix) 

n—k 

= H YPx( W ~n~ lid -3 W -j+l V ) 
wv£A{X) j=l 

n—k 

~ YPx( W -n~ l W-jW-)+l v ) 
wvGA(X) j=l 

n—k 

Y YP X ( W -n~ 1 ^-3 W -]+l V ) 

weA(X),wv£A(x) j=i 

+ Y PxiwZifi)-! 
wZlveA(X) 

m 

Y YPx( W -2m™-3 W -)+l V ) 
wzl m eA{X),wzl m viA(X)3=l 

+ Y PxiwZlnV)-!. 

wZ^veAiX) 



For (2.15), we have 



2^ K +1 (wv) log -— 

weA(X),wv<£A(X) PX{W) 

Euk I \ i ^2m+l ( W -2m V ) 
K +1 (WV) log _! 

weA(X),wv£A(X) PX{W-2m) 
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EV^ I -3-1- -1 M ^2m+l( w -2m V ) 
^ Px (wj n W- jW _ j+1 v) log — ! 

we^(X),uit;^(X) j=l PX{W_ 2m ) 



III 



E 



x ^ Q g ^2m.+l( W -2TO^) 

For (2.16), we have 

XI ^i(^) lo gPx(>M 
iiwe.A(X) 

+ 2^ h n+l (wv)\og 

PxzM=e(e), Pxz (wv)=e(e) °^ W) 



(n—k 
3=1 



j—l - -l 



+ px{w_ l n v) - (n+ 1 - k)p x (wv)j logpx(v\w-m) 
Pxz(w)=0(£),Pxz(w«)=©(e),Px('"_r ) =0 

E "»«(™')^%£r 
(e + E ) 

/n-fc \ 

y^PxiwZi^w-jwZ^+iv) +Px{wZ l n v)j \ogpxiy\wZln) 
-(n+l-k) PxiwZ^logpxivlwZln) 

+ E ^L+iM 

Pxz('H'Z 1 3nl )=e(e), Pxz (wZ 1 3m v)=e(e),p x (wZT~ 1 )>0 

hl m+ l{wv) 



X lo£ 



^3m( 



INPUT-CONSTRAINED CHANNEL CAPACITY 



27 



x lot 



(n-k-m) Px(w_l n v)logpx(v\w^ n ) 

+ XI [j2p( w ~n~ l w-jwZ] +l v)+px(wZiv)\logp x (v\wZ} n ) 
wveA(X) \j=i ) 

-(n + l-k) PxiwllnV^ogpx^wZln) 
wzlveAiX) 

+ Yl h° 3m+1 (wv) 

Pxz(wZ 1 3m )=e>(e),p xz (wZ 1 3m v)=B(e), Px (wZZ; 1 )>0 

h% m+1 (wv) 

(-m-1) Y Vx{wZ} n v)\ogpx{v\wZl ri ) 
wzlveA(X) 

+ J2 \^f2Pxi W -2m^jWZ} +1 v)+p X (wZl m v)\ 

™zl m veA{x) \j=i / 
x logpxH«C^) 
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